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ABSTRACT 


Graph data such as argument diagrams has become increas- 
ingly common in EDM. Augmented Graph Grammars are a 
robust rule formalism for graphs. Prior research has shown 
hat hand-authored graph grammars can be used to auto- 
matically grade student-produced argument diagrams. But 
hand-authored rules can be time consuming and expensive 
o produce, and they may not generalize well to novel con- 
exts. We applied Evolutionary Computation to automati- 
cally induce empirically-valid graph grammars for argument 
diagrams that can be used for automatic grading or provide 
he basis for hints. Our results show that our approach can 
generate more relevant rules than experts or other state of 
he art algorithms, and that these evolved rules outperform 
he alternatives. 


Keywords 
Evolutionary Computation, Augmented Graph Grammars, 
Argument Diagramming, Feature Engineering 


1. INTRODUCTION 


Intelligent tutoring systems and computer-supported collab- 
oration platforms have grown increasingly popular in recent 
years. As they have grown in popularity they have also been 
applied in increasingly complex domains such as argumen- 
tation [14], legal reasoning [22] and writing [6]. MOOCs and 
other online educational platforms have also grown in pop- 
ularity yielding large repositories of user-system interaction 
logs [10], and classical tutors and educational games have 
grown more common in classrooms yielding large reposito- 
ries of student data [13]. Much of this data can be repre- 
sented as rich graph structures such as argument diagrams 
[17] or interaction networks [7]. 


Despite the increasing prevalence of graph data, compara- 
tively little work has been done on automatically evaluating 
student-produced graphs or graph logs. In prior work we 
demonstrated that hand-authored Graph Grammars can be 
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used as features to automatically grade student-produced 
argument diagrams [16, 17]. But hand-authoring complex 
rules is time consuming, expensive, and does not generalize 
well to novel contexts. Other authors have developed an- 
alytical tools tuned to path analysis [24, 3], however these 
are tailored to a specific task. Other more general purpose 
algorithms (e.g. [30, 5]) have limitations and are unsuited to 
the induction of generalized rules that use negation or other 
complex elements. Therefore it has not yet been shown that 
it is possible to automatically induce complex, empirically- 
valid, rules for rich graph structures that are comparable to 
rules produced by domain experts. 


In this paper we will describe our work on the automatic in- 
duction of Augmented Graph Grammars for student-produced 
argument diagrams. Our goal in this work is to explore ways 
to automatically induce empirically-valid graph rules that 
can be used as features for automatic grading and which 
can provide the basis for hints. While our previous work 
was focused on inducing positive rules in [33] and in [19], 
in this work we applied Evolutionary Computation (EC) to 
induce both positive and negative rules for student graphs 
that incorporate more complex elements such as negation 
and generalized types. Additionally, in our previous work we 
compared the induced rules with a small number of expert 
rules while in this work, we will compare our induced rules 
to a full set of complex rules authored by domain experts 
and rules produced by other the state of the art induction 
algorithms. 


2. BACKGROUND 


2.1 Argument Diagrams 

Argument diagrams are semi-formal graphical representa- 
tions that reify key features of arguments such as hypothesis 
statements, claims, and citations as nodes and the support- 
ing, opposing, and clarification relationships between them 
as arcs. Argument diagrams directly connect the syntax 
of the argument representation to the underlying semantics 
thus making it clear and computationally tractable. Argu- 
ment diagrams can serve to make the often implicit structure 
of an argument salient to students while also constraining 
them to make relevant contributions [29]. Prior researchers 
have shown that argument diagrams can be used to scaffold 
students’ understanding of existing arguments [12, 8]; can 
frame collaborative learning [26]; and can help to support 
scientific reasoning [29]. 
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Figure 1: A student-produced Argument Diagram. 


A sample student-produced diagram is shown in Figure 1. 
The diagram includes a central research claim node, which 
has a single text field indicating the content of the research 
claim. A set of citation nodes are connected to the claim 
node via supporting, opposing and undefined arcs colored 
green, red, and blue respectively. Each citation contains two 
fields: one for the citation information, and the other for a 
summary of the work; each arc has a single text field explain- 
ing what purpose the relationship serves. At the bottom of 
the diagram, there is a single isolated hypothesis node that 
contains two text fields, one for a conditional or JF’ field, 
and the other for a consequence THEN field. 


2.2 Augmented Graph Grammars 

Graph Grammars are a graph-based representation for rules 
about graphs that are analogous to string grammars. Graph 
grammar rules are composed of standard graph elements 
such as nodes and directed or undirected arcs. As with string 
grammars they are defined by a finite alphabet of basic or 
ground node and arc types as well as a set of production 
rules for variable elements. A single graph rule defines a 
space or class of matching graphs. Graph grammars can be 
used to generate graphs from an initial seed via recursive 
rule applications where each variable element expands to a 
larger subgraph. They can also be used to match graphs 
in a layered fashion by first mapping all ground elements to 
individual nodes or arcs and then recursively matching the 
sub-elements. Graph grammars have been used for analysis 
and graph transformation in domains such as visual pro- 
gramming [9] and mechanism analysis [27]. 


Augmented Graph Grammars are an extension of traditional 
graph grammars that are allow us to match rich graphs with 
complex node and arc types that contain sub-elements, text, 
and other variable structures [15]. Augmented Graph Gram- 
mars also support: negated elements which select for the 
nonexistence of subgraphs; generalized node and arc types 
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t.Type = “claim” or “hypothesis” 
a.Type = “citation” 
b.Type = “citation” 
c.Type = “comparison” 


Figure 2: A simple augmented graph grammar rule 
that detects uncompared counterarguments. 


which match multiple items; complex element constraints 
which allow us to compare individual elements; complex 
graph expressions which allow for universal and existential 
quantification; and the incorporation of NLP rules or other 
external features. As such they are an ideal rule represen- 
tation for the analysis of argument diagrams, user-system 
interaction logs, and other educational data. 


A sample rule is shown in Figure 2. This rule is designed 
to identify cases of uncompared counterarguments, that is: 
there is an opposing arc O from the citation a to the node t 
and also a supporting arc S from the citation b to the node 
t, however, there exists no comparison arc between the two 
citations a and b. This is designated by the negated arc 7c. 
Here node ¢ is either a claim or hypothesis. The variable 
elements O and S are defined by recursive production rules 
which are not shown. Those rules define supporting paths 
as chains of supporting arcs and opposing paths as chains of 
supporting arcs with any odd numbered (including single) 
chain of opposing arcs. 
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This example rule was designed by a domain expert in ar- 
gumentation. It is designed to identify cases where a stu- 
dent has presented conflicting background information but 
has made no attempt at resolution. This is a critical struc- 
tural flaw that is commonly found in student-produced ar- 
guments. Students at all levels frequently absorb the lesson 
that they must show conflicting citations but routinely fail to 
explain those citations or to resolve the differences in a way 
that clarifies their own argument. As we have shown pre- 
viously such expert-designed rules can be empirically-valid 
and predictive of student performance [16]. However manu- 
ally designing rules can be both costly and inefficient. 


Thus our goal is to automatically induce meaningful rules, 
rules that highlight structural flaws or argumentation errors; 
rules that generalize beyond basic types; and rules that in- 
clude negated elements(detecting non-existing cases). 


2.3. Graph Grammar Induction 

Current grammar induction algorithms fall into one of two 
broad categories: frequent subgraph matching, or graph 
compression. Frequent subgraph algorithms include Yan 
and Han’s gSpan algorithm [32], Inkokuchi’s AGM [1], and 
the FSG algorithm [20]. These algorithms carry out con- 
trolled graph walks to identify common structures. They 
are quite effective, particularly in grounded domains such 
as cheminformatics where the graphs, in this case molecu- 
lar models, have low degree and exact matches are required. 
However the algorithms do not support disjoint subgraphs, 
negation, or generalized elements. While we can, in theory, 
insert explicit negation arcs that would expand the size of 
the graphs exponentially and thus make any search process 
intractable. Similarly, while we could replace individual el- 
ements with generalized forms that would simply force the 
system to use a smaller range of types and would not al- 
low for context-sensitive generalization of elements. These 
algorithms are also ill-suited for identifying errors as the 
search process is strictly unsupervised and finds frequently- 
occurring structures without reference to external weights. 


Graph compression algorithms such as Subdue take a differ- 
ent approach to the problem. Subdue is a recursive beam- 
search algorithm that generates a hierarchical grammar by 
recursive collapse based upon the MDL principle [5]. Sub- 
due operates by iteratively identifying the most frequently 
occurring arc in the graph and then reducing it to a new 
variable node. Unlike gSpan the resulting grammar is hier- 
archical and the beam search process can be used for super- 
vised learning given a suitable set of positive and negative 
examples [11]. The candidate graphs are ranked according 
to a normalized error metric: 


(PosGraphsNotCovered + NegGraphsCovered) 
TotalExamples 


While Subdue is more flexible than the frequentist approaches 
it too does not support generalized elements, negation, or 
disjoint subgraphs. 


2.4 Related Work 


We have previously shown that domain experts can hand au- 
thor augmented graph grammars that are empirically-valid 
and which can be used as features in a regression model 
to automatically grade student-produced diagrams [16, 17]. 


In more recent experiments we have also shown that it was 
possible to apply EC to induce graph grammars that are 
positively correlated with argument grades and that we can 
apply x?-filtering to select unique rules from the large space 
of candidates [19]. We were also able to show that the in- 
duced rules outperformed rules generated by both Subdue 
and gSpan and outperformed similar expert rules that fit 
into the limited rule space. The rules produced in that study, 
however, were limited in scope. While they supported dis- 
joint graphs, they did not identify errors, and did not sup- 
port generalized elements or negation. In this work we will 
build upon these results to include generalization and nega- 
tion, and we will compare the resulting rules to a full set of 
77 hand-coded expert rules. 


3. METHODS 


We conducted two experiments on the induction of Aug- 
mented Graph Grammars using EC. First we applied EC to 
induce graph rules composed of static node and arc types 
that were both positively and negatively correlated with the 
overall argument quality. That is, we sought to identify 
ground rules that either highlighted good features of argu- 
ments (positive) or matched structural flaws(negative). 
We then compared them to expert-produced rules and to 
rules induced by the Subdue and gSpan algorithms. In our 
second experiment we applied EC to induce rules that also 
incorporated generalized nodes as well as negated arcs (de- 
tecting non-existing cases). We describe them below. 


Evolutionary Computation is a general beam-search algo- 
rithm based upon Natural Selection. The EC algorithm be- 
gins with a population of candidate solutions in a shared 
solution representation. This population may be randomly 
generated or supplied by the user. The individual solutions 
are then ranked by means of a fitness function which may 
be an absolute performance metric or a form of tournament 
selection. The next generation of the population is then 
formed by a combination of fitness proportional selection, 
crossover or recombination of candidate solutions, random 
mutation of solutions, and elitist cloning. EC algorithms 
proceed iteratively until a given fitness threshold is reached 
or a fixed number of generations has passed. EC has been 
used in a number of applications such as tuning Neural Net- 
works [21], and evolving computer code [2]. 


EC has a number of advantages over other special-purpose 
induction algorithms. Firstly, it is very flexible, the behav- 
ior of the system is determined by the user-specified solution 
representation and the genetic operators. This makes it easy 
to tune the behavior of the system to include new types of 
elements or to test out alternative inductive biases. Sec- 
ondly, EC is very robust, the basic algorithm can be applied 
in a wide range of domains and it can be used in areas where 
the contours of the search space is unknown. There are a 
number of widely-available EC systems. For the purposes 
of this research we used py#C an open-source EC engine 
[18] coupled with AGG an engine for graph matching using 
Augmented Graph Grammars [15]. 


The rules induced in Experiment I consisted entirely of ground 
nodes and arcs while the rules induced in Experiment II in- 
cluded generalized node types and negated comparisons as 
shown in Figure 2. For both experiments we assessed the 
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Figure 3: Canonical matrices for crossover. 


fitness of the rules using the same nonparametric frequency 
correlation that we discussed in Subsection 2.4 with the tar- 
get values being maximized or minimized depending upon 
the experimental goals. 


Mutation in the EC algorithm is a general-purpose opera- 
tion that is designed to promote exploration by introducing 
heterogeneity into the population. For this set of experi- 
ments we applied basic point mutation that added, deleted, 
or modified individual graph elements (see [33, 19]). Here 
mutation occurred with a small constant frequency when 
individuals were added to each population. 


For these experiments we employed stable matrix crossover 
based upon the work of Stone, Pillmore, & Cyre [28] illus- 
trated in in Figure 3. In this form of crossover we select 
a pair of parent graphs using fitness-proportional selection 
and represent them as adjacency matrices (Po). The nodes 
are represented by letters on the rows and columns, while 
he arcs are represented by the numbered cells within the 
able. Empty cells indicate the absence of an arc. The order 
of elements in the matrices is canonical and is determined 
by the order in which the nodes were added to the rule. 


On crossover we align the nodes and arcs in the parent ma- 
rices and then randomly shuffle the nodes and arcs between 
hem based upon a series of coin tosses to produce the two 
children (Co). Any constraints that are attached to an indi- 
vidual element are copied with it. Matrix crossover always 
produces two children that match the size of their parents 
with all excess elements being copied directly to the larger 
of the two offspring. Table 1 shows this crossover process at 
the graph level. By design crossover is an adaptive process 
that is designed to promote homogeneity and to preserve 
good building blocks or partial solutions called introns [2]. 


4. DATA 


Our experimental analysis was based upon two previously- 
collected datasets. The first is a set of student-produced 
argument diagrams for empirical research reports. The sec- 
ond is a repository of hand-authored rules defined by domain 
experts. Both datasets were collected as part of our prior 
work on the diagnosticity of argument diagrams [16, 17]. 
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Table 1: Graphical representation for crossover. 


4.1 Argument Data 

Our repository of argument diagrams was collected at the 
University of Pittsburgh in a course on Psychological Re- 
search Methods. Students in the course learn about design- 
ing, conducting, and reporting on empirical research. The 
course has a significant writing component. Students com- 
plete two research projects over the course of the semester 
both of which result in a written report modeled on a confer- 
ence publication. They are allowed to work on the projects 
individually or as a team of two. For the purposes of our 
study, the students were required to plan their written argu- 
ments graphically before writing them. The diagrams were 
authored using LASAD, an online tool for argument dia- 
gramming and collaboration [14]. The diagramming ontol- 
ogy contained four types of nodes: citation, claim, current 
study and hypothesis; and four types of arcs: supporting, 
opposing, comparison, and undefined. Currstudy nodes are 
used to represent factual information about the study such 
as the target population. Undefined arcs represent cases 
where nodes provide clarification or concept definitions. 


After removing dropouts and one diagram containing a sin- 
gle node, we collected a set of 104 paired diagrams and es- 
says from the course. These diagrams and essays were in- 
dependently graded by an experienced TA according to a 
parallel rubric with 14 questions that were focused on the 
argument’s quality, coherence, use of citations, and other 
criteria. In this work we will focus on the gestalt grades 
for overall graph and essay quality. The gestalt grades were 
assigned on an 11 point scale from -5 (worst quality) to +5 
(complete, coherent, and persuasive) at 3 point intervals. 
This same dataset was used in our prior work [19]. 


4.2. Expert rules 

In parallel with data collection, we also collaborated with a 
group of domain experts to define a set of 77 a-priori argu- 
ment rules. These rules were designed to identify individ- 
ual features of argument diagrams or sub-graphs that were 
consistent with high quality argumentation or which repre- 
sented structural flaws. Thirty-four of these rules focused on 
basic features such as the size or order of the diagram, the 
average number of parents and children, or the presence of 
empty elements. The remainder were complex rules that de- 
scribed the relationship between elements or matched larger 
graph structures such as the uncompared counterarguments 
shown in Figure 2. These rules included features that dealt 
with the text inside the elements, appropriate grounding of 
hypotheses or claims in citations, connectedness of the dia- 
gram, and the appropriate use of individual elements. 
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In prior work we evaluated whether or not these rules were 
empirically-valid. That is whether or not they correlated 
with the independently-assigned diagram grades and whether 
or not they could be used to predict the paired essay grades 
[16, 17]. In that work we assessed the validity of each indi- 
vidual rule by testing the correlation between the observed 
rule frequency on each diagram and the final graph or essay 
grade. The strength of this correlation was assessed using 
Spearman’s p a nonparametric correlation measure [31]. We 
found that most, but not all of the rules were strongly cor- 
related with the grades. We also found that some of the cor- 
relations ran counter to the experts’ a-priori expectations. 


5. EXPERIMENTS 


In this work we induced sets of baseline rules using the Sub- 
due and gSpan algorithms. We also conducted two sets 
of evolutionary experiments designated EC-Base and EC- 
General. The rules from each of these experiments were 
compared to assess their overall performance. 


Subdue: For these experiments we used Subdue V5 [4] in 
supervised learning mode to induce rules that were positively 
and negatively correlated with the overall graph and essay 
grades. In order to induce positively correlated rules we 
partitioned the graphs into positive and negative examples 
based upon their graph or paired essay score. All graphs 
with a grade of 0 or more were treated as positive exam- 
ples, and all graphs with a negative grade were treated as 
negative examples. We then ran the system to extract the 
12 best rules. In order to induce negatively-correlated rules 
we reversed the assignment with rules that were graded less 
than or equal to 0 being treated as positive examples and 
all others being treated as negative. We experimented with 
more restrictive thresholds > 0 and < 0 and found the per- 
formance did not improve. 


gSpan: In this experiment we used gSpan v6 [34]. The soft- 
ware runs in strictly unsupervised mode where it returns all 
subgraphs whose frequency exceeds a user-specified thresh- 
old. In this case we ran the software over our dataset and 
collected all rules that exceeded a 1% threshold and then 
ranked the candidate rules based upon their p value to iden- 
tify the most positive and negative examples. 


EC-Base: In this experiment, we conducted a series of 
six evolutionary runs that were tuned to induce negatively- 
correlated rules. Three of those runs used the graph grade 
as a target and three used the essay grade. In each case 
we used a fixed population size of 100 individuals and ran 
the algorithm for 1,000 generations. In each generation, we 
cloned the top 10 individuals directly into the next genera- 
tion under elitism. We selected 10 individuals for point mu- 
tation and the remaining 80 individuals for crossover, then 
we copied the results over to the next generation. Fitness 
values were assigned using a fixed measure of —p for each 
individual rule. The initial populations were composed of 
randomly-generated individuals containing 3 - 10 elements 
each. The nodes and arcs were all ground elements and 
were selected from a predefined ontology of basic types that 
matched the types used in the argument diagrams. 


Unlike standard EC we did not rely solely on the final popu- 
lation of rules for our results. EC populations grow increas- 


ingly homogeneous over time making the final population 
virtual clones. In this case our goal was to induce a range of 
potential rules. We therefore collected candidate rules from 
each generation of the run by selecting every rule with a 
p< —0.1. The full set was used in our analysis. 


EC-General: Here we conducted a series of twelve evolu- 
tionary runs. Six of the experiments were tailored to induce 
positively correlated rules while the rest were tailored to in- 
duce negatively-correlated ones. As with EC-Base the popu- 
lation size was 100, the algorithm ran for 1,000 generations, 
and we used +p as the basic fitness metric and the muta- 
tion and crossover rate were the same as before. Unlike the 
EC-Base study these rules also included negated comparison 
arcs as well as two generalized node types: nodes that are ci- 
tations or claims (CitOrClaim) and nodes that are hypothe- 
ses or claims (HypOrClaim). These elements were chosen 
for addition because they were used by the domain experts 
when crafting their rules. As before we collected candidate 
rules from the positive and negative runs with thresholds 
of (p > 0.18) and (p < —0.1) respectively. These thresholds 
were chosen based upon a series of exploratory runs in which 
we found that the p values became statistically significant 
after exceeding +0.18. 


6. RESULTS & ANALYSIS 


Table 2 shows the number of positively and negatively corre- 
lated rules for the Graph grades (columns 3 and 4) and the 
Essay grades (columns 5 and 6) that were collected during 
our experiments. Total designates the total number of rules 
produced by each method or in the expert set, while Thresh- 
old indicates the number for which p > 0.18 or p < —0.18 
in the positive and negative cases respectively. 


As Table 2 shows the EC approaches generated the largest 
number of candidate rules in both the positive and negative 
cases. Of the expert rules, most of them were positively 
correlated with performance but less than half of them ex- 
ceeded the cutoff thresholds. Indeed only two of the expert 
rules did so for the essay grades. Both Subdue and gSpan 
identified positively and negatively-correlated rules but only 
a few of the positive rules exceeded the threshold. None of 
the negative rules did so. 


Next, we will describe the rules induced during our EC-Base 


Table 2: Number of Positive and Negative Rules 


Graph Essay 
Methods Pos Neg Pos Neg 
Total 12 2 8 10 
Subdue’ ipieshold: ~ 11 0 3 0 
Ss Total 34 5 27 12 
Span Threshold 12 0 6 0 
E t Total 56 21 46 32 
xpert Threshold 25 6 0 2 
Total 82 256 172 160 
Ee-B Threshold 82 51 172-22 
Total 394 392 652. -+518 
EC-G Threshold 394 198 652 30 


* Threshold: number of rules with p > 0.18 or p < —0.18 


Proceedings of the 9th International Conference on Educational Data Mining 259 


Table 3: Spearman correlation values for the best 3 rules in each experiment. 


Positive-correlated 


Negative-correlated 


Graph Essay Graph Essay 
1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 
Subdue .276 .270 253 281 215 181 -.050 -.022 NA -.173 -.167 -.164 
gSpan = 352 314 272 300 281 261 -.137 -.063 -.05 -.123 -.102 -.075 
Expert .427*  .338 329 .180 138 137 -.238 -.236 -.202 -.256 -.218 -.148 
EC-B _ .371 .369 .362 334 334 319 -.272 -.272 -.271* -.233 -.233 -.233 
EC-G __ .396 .391*  .385* .357* .357* .356* -.273* -.272*  -.270 -.269* -.269* -.269* 
x The best of results for Experiment I is in bold; 
x ‘x’ is for best of results across both Experiment I and II. 
experiment and we will discuss how they compare to the kO kl k* .Type = “claim” 
expert rules and the rules induced by Subdue and gSpan. We cs * .Type = “currstudy” 
will then discuss the EC-General rules and compare them to (B-G-N) k2 k8 
our earlier results. csO esl 
6.1 Experiment I: EC-Base k k.Type = “claim" 
Rows 1-4 in Table 3 list p values for the three best rules \ c.Type ~ ettation 2 
from the four methods. The bold values indicate the best (B-E-N) u u.Type = “unspecified 


performing rule among the sets. As the table illustrates EC- 
B outperformed both Subdue and gSpan across the board. 
And it outperformed the expert rules in most cases. The 
lone exception being the best positive case for the graph 
grades and the best negative case for the essay grades. 


The best positively-correlated expert rule for the graph grades 
matched arcs with empty text fields. The best negatively- 
correlated expert rule with the essay grade matched graphs 
with no hypothesis nodes. Both of these rules relied on com- 
plex grammar features, textual rules and expressions, that 
were outside the scope of our current experiments. 


kO h kx .Type = “claim” 
rr h.Type = “hypothesis” 
(B-G-P) shied - del cx .Type = “citation” 
/ \ sx* .Type = “supporting” 
c0 cl 
h h.Type = “hypothesis” 
Fam c* .Type = “citation” 
(B-E-P) s 0 s.Type = “supporting” 
\ o.Type = “opposeing” 


cO0 cl 


Figure 4: EC-Base: Strongest Positively-correlated 
Rules Induced by EC. 


Figures 4 and 5 illustrate the best positive and negative rules 
induced by the EC-Base runs. In Figure 4 graph rule B-G-P 
represents a rule that has 5-nodes, two of which are cita- 
tions (cO & cl) that support a shared claim node (k0). The 
remaining nodes are a single claim (k1) and a hypothesis 
(h) which may or may not be connected to the rest of the 
structure. This reflects a graph where the authors identi- 
fied at least two related citations that can be synthesized 
to support a single claim and where they included both a 
hypothesis and another claim. This is one of the structures 
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Figure 5: EC-Base: Stronges Negatively-correlated 
Rules Induced by EC. 


that students have been encouraged to make in their argu- 
ments as it shows an ability to synthesize citations to form 
a complex claim. 


Interestingly, the best positive essay rule (B-E-P) is very 
closely related to the expert rule shown in Figure 2. Here it 
selects for the presence of a hypothesis node (h) that is di- 
rectly connected to two citations (cO0 & cl). Here cO directly 
supports h while cl directly opposes it. Given that the al- 
gorithm could not induce variable arcs it is not surprising 
that it does not include paths. The absence of a comparison 
arc, however, is interesting. As we noted above the students 
were instructed to include one. The fact that this rule per- 
forms so well despite lacking one suggests that the students 
did not regularly do so. 


Figure 5 shows the best negative rules. As stated above, we 
expect that these rules will flag errors or persistent struc- 
tural flaws. B-G-N consists of 4 claim nodes (k0 — k3) and 
two currstudy nodes (cs0 & cs1) all of which may or may 
not be connected to one-another. While this rule has a high 
correlation with the grade, its semantic meaning is unclear. 
It is possible that it is detecting is overly large graphs that 
lack sufficient focus. In future work we will evaluate the 
matching graphs with domain experts to assess this. 


B-E-N is easier to interpret. In this case the rule contains a 
single claim node (k) which is connected to a citation node 
(c) via an undefined arc (wu). This is a clear violation of the 
semantic guidance that students were given. The students 
in the experiment were instructed to use unspecified arcs 
for definitions or clarifications only. Some students instead 
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used them when they were unsure about the strength of their 
evidence or did not understand the citation. The students 
were also instructed to use citations to add information to 
their claims, not the other way around. For a student to 
use an unspecified arc in this way suggests that they were 
unsure about the structure or content of the argument. 


6.2 Experiment II: EC-General 

The last row of Table 3 shows the performance of the EC- 
General rules. These rules were compared against all of the 
rules in Experiment 1. The best performing rules across 
both experiments are in bold and marked *. As Table 3 
shows EC-General produced better performing rules than 
EC-Base. All but one of the p values on the final row exceeds 
the corresponding value on the fourth, and the one that does 
not do so falls behind by only 0.001. EC-General outper- 
formed the best negative expert rule for the essay grades 
(-0.269 vs. -0.256), despite the fact that the expert rule 
relied on complex expressions. The best expert rule for the 
graph grade still outperforms EC-General. Thus, our results 
for EC are better than all other methods save for one expert 
rule that relies on novel textual features. 


Figure 6 shows the best positively-correlated rules for the 
graph and essay grades. G-G-P matches cases where a sup- 
porting arc has been drawn from a citation or claim to a 
claim or hypothesis. In short, it matches correct uses of 
supporting arcs. This is a good feature that indicates well- 
supported arguments. G-E-P, by contrast, is complex and 
selects for a graph with three claim nodes (k0 — £2) and two 
uncompared citations (cO & cl), where cl directly supports 
a hypothesis or claim (hk) which in turn has an unspecified 
arc to a citation or claim node (ck). The semantic meaning 
of this rule is unclear and will require deeper analysis. 


Figure 7 shows the strongest negatively-correlated rules. As 
with G-E-P, G-G-N, is somewhat hard to interpret. It se- 
lects for a number of disjoint nodes, and for the presence 
of a currstudy node (cs0) as well as a claim (k3) which are 
not connected via a comparison arc. Further analysis is re- 
quired to determine why this rule holds. G-E-N, by contrast 
represents a clear variation on B-E-N. Here we select for a 
hypothesis or claim node (hk) that has an undefined arc to 
a citation along with a separate hypothesis node that may 
or may not be connected. This rule is interesting because 
in part it will select a superset of the graphs matched by 
B-E-N but the presence of the extra hypothesis node will 
restrict that somewhat. This suggests that this rule may be 
relatively specific to our dataset. We plan to examine the 
matching graphs to assess its generality. 


7. CONCLUSIONS 


In this paper, we reported our work on the automatic induc- 
tion of Augmented Graph Grammars for student-produced 
argument diagrams through EC. In prior work we demon- 
strated that hand-authored expert rules can be empirically- 
valid and that those valid rules can be used for automatic 
grading. We have now shown that it is possible to auto- 
matically induce complex rules for argument diagrams that 
match both positive and negative examples and which can 
therefore be used as features for automatic grading. We have 
also shown that the induced rules outperform all but one 
of the expert rules and the rules induced by other general- 
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Figure 6: EC-General: Strongest Positively- 


correlated Rules Induced by EC. 
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Figure 7: EC-General: Strongest Negatively- 


correlated Rules Induced by EC. 


purpose grammar induction algorithms. The strongest ex- 
pert rule was outside the scope of this experiment. 


In future work we plan to work with domain experts to eval- 
uate these rules. Our goal will be to determine whether the 
rules are semantically valid, and whether or not they can 
serve as the basis for automatic hints. We will also assess 
whether or not the rules can be used for data-driven grading 
by using them as features in a regression model. And finally 
we will expand the scope of our EC induction to include the 
automatic induction of hierarchical rules with expressions 
and complex element constraints. 
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